Before we create visualisations, it’s important to understand the principles of data visualisation:
Common visualisation types include: - Scatter plots: For visualising the relationship between two continuous variables. - Bar plots: For comparing categorical data. - Histograms: For displaying the distribution of a numeric variable. - Line plots: For trends over time or continuous data.
Misleading visualisations can distort the interpretation of data by manipulating elements like scales, axes, or colors. One common tactic is truncating the y-axis, which exaggerates small differences and makes them appear more significant than they are. Another technique is inconsistent scaling, where proportions are not accurately represented, leading to misinformed conclusions. It is essential to use clear and honest visual representations to ensure that the data is communicated accurately without introducing bias or confusion.
Lets say we want to report on whether the number of crimes has increased over two selected years.
## Warning: package 'ggplot2' was built under R version 4.3.2
From the above plot, it seems that the crime rate has jumped significanlty from 2010 to 2011. However, can you notice anything suspicious about the plot?
Now, if we set the scale to start at zero we get:
Now we can see that in reality, the crime rate has only increased marginally. This is a common tactic used in politics and the news when reporting. For example:
However, a truncated scale is not always a bad thing. In the example below, we show the IQ for three individuals. Setting the scale at zero, it appears that all three people have a very similar IQ:
However, since IQ is sensitive to small changes, it makes more sense to view the differences on a smaller scale to get a true picture of the differences between the people:
A famous example of a misleading plot was published by Reuters news agency on a report about gun safety in Florida. What can you notice that is unusual about this plot?
See more: https://www.nytimes.com/column/whats-going-on-in-this-graph
Before we begin creating visualisations, let’s start by clearing the R environment and loading a dataset that contains a variety of column types.
To ensure we start fresh, let’s clear the environment by removing all existing objects.
# Clear the environment
rm(list = ls())
diamonds DatasetWe will use the built-in diamonds dataset from the
ggplot2 package, which contains 53,940 rows and 10 columns.
This dataset includes both numeric and
factor columns, making it suitable for various types of
visualisation.
First, let’s load the ggplot2 package and then the
diamonds dataset.
# Load the ggplot2 package
library(ggplot2)
# Load the diamonds dataset
data(diamonds)
# View the first few rows of the dataset
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# Check the structure of the dataset to see the different column types
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Explanation:
- data(diamonds): Loads the
diamonds dataset, which contains 53,940 observations and 10
variables related to diamonds, including price, carat, cut, color, and
clarity. - head(diamonds): Displays the
first 6 rows of the dataset. -
str(diamonds): Shows the structure of the
dataset, with several factors (e.g., cut,
color, clarity) and numeric columns (e.g.,
price, carat).
As we can see, some of the variables have an
ordered factors structure. An ordered factor is a special
type of factor where the levels have a meaningful order or hierarchy.
This is used when the categories can be ranked or ordered in a logical
sequence, such as “low”, “medium”, and “high”.
Key Differences Between Regular Factor and Ordered Factor
1. Order of Levels:
• Regular Factor: The levels are unordered, meaning that R does not assume any relationship between the categories.
• Example: Colors (red, blue, green) don’t have a natural order, so they would be a regular factor.
• Ordered Factor: The levels have a specific order, and R will treat them as ranked.
• Example: Education levels (high school, bachelor's, master's, PhD) have a natural order, so they would be an ordered factor.
2. Comparison:
• Regular Factor: Categories cannot be compared using greater than (>) or less than (<) operators. Trying to do so will return an error.
• Ordered Factor: Since the levels have a meaningful order, they can be compared using greater than (>) or less than (<) operators.
• For example, bachelor's > high school would return TRUE for an ordered factor.
Let’s begin by creating some basic plots using base R to understand
fundamental visualisation techniques. We’ll use the
diamonds dataset.
A scatter plot shows the relationship between two continuous
variables. Let’s create a scatter plot between carat
(diamond size) and price.
# Scatter plot of carat vs price
plot(diamonds$carat, diamonds$price,
main = "Scatter Plot of Carat vs Price",
xlab = "Carat",
ylab = "Price",
col = "blue",
pch = 19)
Explanation:
The plot() function is used to create a scatter plot, with
carat on the x-axis and price on the y-axis.
The points are colored blue (col = "blue") and use solid
circles (pch = 19).
A bar plot visualises the frequency of categories in a factor
variable. Let’s create a bar plot for the cut variable,
which represents the quality of the diamond’s cut.
# Bar plot of the frequency of diamond cut
barplot(table(diamonds$cut),
main = "Bar Plot of Diamond Cut",
xlab = "Cut",
ylab = "Frequency",
col = "lightblue")
Explanation:
We use table() to count the frequency of each level in the
cut variable and barplot() to visualise the
distribution of cut quality. The bars are colored light blue
(col = "lightblue").
A histogram helps visualise the distribution of a continuous
variable. Let’s create a histogram for the price of
diamonds.
# Histogram of diamond prices
hist(diamonds$price,
main = "Histogram of Diamond Prices",
xlab = "Price",
col = "orange",
border = "black")
Explanation:
The hist() function creates a histogram of
price to show its distribution. The bars are colored orange
(col = "orange"), and border = "black" adds
black borders around the bars.
Though the diamonds dataset doesn’t include a time
series, a line plot can still be used to show trends. Here, we’ll plot
the average price of diamonds by carat size.
# Line plot of average price by carat
avg_price <- aggregate(price ~ carat, data = diamonds, FUN = mean)
plot(avg_price$carat, avg_price$price,
type = "l",
main = "Line Plot of Average Price by Carat",
xlab = "Carat",
ylab = "Average Price",
col = "blue",
lwd = 2)
Explanation:
We use aggregate() to calculate the mean price for each
carat size and plot it using plot() with
type = "l" to create a line plot. The line is blue
(col = "blue") and its width is increased with
lwd = 2.
ggplot2The ggplot2 package is a powerful and
flexible visualisation tool in R. It follows the grammar of
graphics, where plots are built layer by layer, allowing for
complex and highly customised visualisations.
ggplotA basic ggplot consists of the following components: -
Data: The dataset to visualise. - Aesthetic
Mappings: Map data to visual properties (e.g., x,
y, color). - Geometries
(Geoms): The type of plot to create (e.g., points, bars,
lines). - Layers: Additional layers such as titles,
labels, themes.
ggplot2Let’s create a scatter plot of carat vs
price, similar to the base R example, but using
ggplot2.
# Scatter plot of carat vs price using ggplot2
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(color = "blue") +
labs(title = "Scatter Plot of Carat vs Price", x = "Carat", y = "Price")
Explanation:
In this example, we use ggplot() to define the data and
aesthetics (aes()). The geom_point() function
adds points to create a scatter plot, and labs() is used to
add a title and axis labels.
ggplot2Let’s recreate the bar plot of the cut variable using
ggplot2.
# Bar plot of diamond cut using ggplot2
ggplot(data = diamonds, aes(x = cut)) +
geom_bar(fill = "lightblue") +
labs(title = "Bar Plot of Diamond Cut", x = "Cut", y = "Frequency")
Explanation:
We use geom_bar() to create a bar plot of the
cut variable. The fill argument sets the color
of the bars, and labs() is used to add the title and axis
labels.
ggplot2Let’s create a histogram of price using
ggplot2.
# Histogram of diamond prices using ggplot2
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 1000, fill = "orange", color = "black") +
labs(title = "Histogram of Diamond Prices", x = "Price", y = "Count")
Explanation:
We use geom_histogram() to create a histogram of the
price variable. The binwidth parameter
controls the width of the bars, and fill and
color set the bar colors.
ggplot2Customising plots in ggplot2 allows you to tailor the
appearance of your visualisations. You can adjust elements such as
titles, axis labels, themes, scales, and even create subplots
(faceting).
You can use labs() to add or modify titles, axis labels,
and captions for your plots.
# Customising titles and axis labels
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(color = "blue") +
labs(
title = "Scatter Plot of Carat vs Price",
subtitle = "Data from the diamonds dataset",
x = "Carat (size of diamond)",
y = "Price (in US dollars)",
caption = "Source: ggplot2 diamonds dataset"
)
Explanation:
- title and subtitle: Add a
main title and a subtitle to the plot. - x and
y: Customise the axis labels. -
caption: Add a caption at the bottom of
the plot.
ggplot2 provides several pre-built themes that can
change the overall look of your plots. You can also customise these
themes further.
# Applying different themes
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(color = "blue") +
labs(title = "Scatter Plot with Custom Theme") +
theme_bw() # Apply the bw theme
Common themes in ggplot2 include: -
theme_minimal(): A clean, minimalistic
theme. - theme_classic(): A theme with
classic x and y axis lines. - theme_bw():
A black and white theme.
Explanation:
Themes allow you to quickly change the appearance of your plot without
manually customising each element. You can also modify individual
elements of a theme using theme().
You can adjust scales for axes, colours, and sizes to better represent your data. This includes modifying axis limits, colour schemes, and scale transformations (e.g., log scale).
# Scatter plot with customised scales
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(aes(color = cut)) + # Colour points by cut
labs(title = "Price vs Carat Coloured by Cut") +
scale_x_continuous(limits = c(0, 3)) + # Set limits for the x-axis
scale_color_brewer(palette = "Dark2") + # Apply a custom colour palette
theme_bw()
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).
Explanation:
- scale_x_continuous(): Sets the limits
for the x-axis (in this case, carat is limited to between 0 and 3). -
scale_color_brewer(): Changes the colour
scheme using a pre-defined palette from the RColorBrewer package.
Faceting allows you to create multiple plots based on a categorical variable, essentially creating subplots that share the same axes and scales.
# Facet scatter plot by cut
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(aes(color = cut)) +
labs(title = "Faceted Plot of Price vs Carat by Cut") +
facet_wrap(~cut) + # Create a facet for each level of cut
theme_bw()
Explanation:
- facet_wrap(~cut): Creates a separate
scatter plot for each level of the cut variable, arranging
them in a grid format.
Faceting is useful for comparing how a relationship changes across different categories (e.g., different cuts of diamonds).
Annotations allow you to add text, arrows, or shapes to highlight
specific points or areas of interest in your plots. You can use
annotate() to add annotations to your ggplot2
visualisations.
# Scatter plot with annotation
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(color = "blue") +
labs(title = "Annotated Scatter Plot of Carat vs Price") +
annotate("text", x = 1, y = 15000, label = "High Value", color = "black", size = 10) + # Add a text annotation
annotate("rect", xmin = 1, xmax = 1.5, ymin = 5000, ymax = 10000, alpha = 0.5, fill = "yellow") + # Highlight an area
theme_bw()
Explanation:
- annotate("text"): Adds text at the
specified x and y coordinates. -
annotate("rect"): Draws a shaded rectangle
to highlight a region of the plot, with adjustable transparency using
alpha.
Error bars are used to represent the uncertainty or variability of
the data. In ggplot2, you can add error bars to line or bar
plots using geom_errorbar().
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Bar plot with error bars
avg_price <- diamonds %>%
group_by(cut) %>%
summarise(mean_price = mean(price), sd_price = sd(price)) # Calculate mean and standard deviation
ggplot(avg_price, aes(x = cut, y = mean_price, fill = cut)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean_price - sd_price, ymax = mean_price + sd_price), width = 0.2) +
labs(title = "Bar Plot of Mean Price by Cut with Error Bars", x = "Cut", y = "Mean Price") +
theme_bw()
Explanation:
- geom_errorbar(): Adds vertical error
bars to represent the standard deviation of the mean price for each cut.
- ymin and ymax: Set the
lower and upper limits of the error bars using the mean price ± standard
deviation.
To remove the legend, we can adjust it’s setting via the
theme() function
library("dplyr")
# Bar plot with error bars
avg_price <- diamonds %>%
group_by(cut) %>%
summarise(mean_price = mean(price), sd_price = sd(price)) # Calculate mean and standard deviation
ggplot(avg_price, aes(x = cut, y = mean_price, fill = cut)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean_price - sd_price, ymax = mean_price + sd_price), width = 0.2) +
labs(title = "Bar Plot of Mean Price by Cut with Error Bars", x = "Cut", y = "Mean Price") +
theme_bw() +
theme(legend.position = 'none') # Removes the legend
One of the strengths of ggplot2 is its ability to add
multiple layers to a single plot. This allows you to combine different
geometries, such as points and lines, on the same plot.
# Scatter plot with a regression line and points
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(color = "blue", alpha = 0.5) + # Add points with transparency
geom_smooth(method = "lm", color = "red", se = FALSE) + # Add a regression line
labs(title = "Scatter Plot of Carat vs Price with Multiple Layers", x = "Carat", y = "Price") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Explanation:
- geom_point(): Plots the scatter points.
- geom_smooth(): Adds a regression line
(linear model). - By layering multiple geoms, you can easily combine
different visualisation elements into a single plot.
These two sections expand upon customisation options in
ggplot2, allowing you to create more informative and
professional visualisations. Let me know if you need further
modifications!
Once you’ve created a plot, you can save it to a file using
ggsave().
# Save the last plot as a PNG file
ggsave("scatter_plot.png", width = 7, height = 5)
Explanation:
The ggsave() function saves the last plot created. You can
specify the filename, file format, and dimensions of the output.
plotlyInteractive visualisations allow users to explore the data by
hovering over points, zooming in, and panning around the plot. The
plotly package can easily turn static ggplot2
plots into interactive graphics.
plotlyFirst, you need to install and load the plotly
package.
# Install plotly
#install.packages("plotly")
# Load the plotly library
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplot2 Plot to an Interactive
PlotTo make a ggplot2 plot interactive, simply pass the plot
object to the ggplotly() function.
Let’s convert a scatter plot of carat vs
price from the diamonds dataset into an
interactive plot.
# Create a ggplot2 scatter plot
p <- ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(color = "blue") +
labs(title = "Interactive Scatter Plot of Carat vs Price", x = "Carat", y = "Price") +
theme_bw()
# Convert to interactive plot using ggplotly
ggplotly(p)
Explanation:
- ggplotly(p): Converts the
ggplot2 plot p into an interactive plot. You
can now hover over points to see their values and zoom in/out of the
plot.
Let’s convert a bar plot of the cut variable from the
diamonds dataset into an interactive bar plot.
# Create a ggplot2 bar plot
p_bar <- ggplot(data = diamonds, aes(x = cut, fill = cut)) +
geom_bar() +
labs(title = "Interactive Bar Plot of Diamond Cut", x = "Cut", y = "Count") +
theme_bw() +
theme(legend.position = 'none')
# Convert to interactive plot
ggplotly(p_bar)
Explanation:
This converts a bar plot into an interactive format. You can hover over
the bars to see the counts of each cut category.
The plotly package allows you to easily convert static
ggplot2 plots into interactive visualisations. By simply
using the ggplotly() function, your plots become
interactive, enabling zooming, panning, and hovering over points to
display additional information.